Minimal entropy probability paths between genome families.

نویسندگان

  • Calvin Ahlbrandt
  • Gary Benson
  • William Casey
چکیده

We develop a metric for probability distributions with applications to biological sequence analysis. Our distance metric is obtained by minimizing a functional defined on the class of paths over probability measures on N categories. The underlying mathematical theory is connected to a constrained problem in the calculus of variations. The solution presented is a numerical solution, which approximates the true solution in a set of cases called rich paths where none of the components of the path is zero. The functional to be minimized is motivated by entropy considerations, reflecting the idea that nature might efficiently carry out mutations of genome sequences in such a way that the increase in entropy involved in transformation is as small as possible. We characterize sequences by frequency profiles or probability vectors, in the case of DNA where N is 4 and the components of the probability vector are the frequency of occurrence of each of the bases A, C, G and T. Given two probability vectors a and b, we define a distance function based as the infimum of path integrals of the entropy function H( p) over all admissible paths p(t), 0 < or = t< or =1, with p(t) a probability vector such that p(0)=a and p(1)=b. If the probability paths p(t) are parameterized as y(s) in terms of arc length s and the optimal path is smooth with arc length L, then smooth and "rich" optimal probability paths may be numerically estimated by a hybrid method of iterating Newton's method on solutions of a two point boundary value problem, with unknown distance L between the abscissas, for the Euler-Lagrange equations resulting from a multiplier rule for the constrained optimization problem together with linear regression to improve the arc length estimate L. Matlab code for these numerical methods is provided which works only for "rich" optimal probability vectors. These methods motivate a definition of an elementary distance function which is easier and faster to calculate, works on non-rich vectors, does not involve variational theory and does not involve differential equations, but is a better approximation of the minimal entropy path distance than the distance //b-a//(2). We compute minimal entropy distance matrices for examples of DNA myostatin genes and amino-acid sequences across several species. Output tree dendograms for our minimal entropy metric are compared with dendograms based on BLAST and BLAST identity scores.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Risk measurement and Implied volatility under Minimal Entropy Martingale Measure for Levy process

This paper focuses on two main issues that are based on two important concepts: exponential Levy process and minimal entropy martingale measure. First, we intend to obtain   risk measurement such as value-at-risk (VaR) and conditional value-at-risk (CvaR) using Monte-Carlo methodunder minimal entropy martingale measure (MEMM) for exponential Levy process. This Martingale measure is used for the...

متن کامل

The Sum-Over-Paths Covariance: A novel covariance measure between nodes of a graph

This work introduces a link-based covariance measure between the nodes of a weighted, directed, graph where a cost is associated to each arc. To this end, a probability distribution on the (usually infinite) set of paths through the network is defined by minimizing the sum of the expected costs between all pairs of nodes while fixing the total relative entropy spread in the network. This result...

متن کامل

Maximum entropy change and least action principle for nonequilibrium systems

A path information is defined in connection with different possible paths of irregular dynamic systems moving in its phase space between two points. On the basis of the assumption that the paths are physically differentiated by their actions, we show that the maximum path information leads to a path probability distribution in exponentials of action. This means that the most probable paths are ...

متن کامل

Weighted paths between partitions

Developing from a concern in bioinformatics, this paper analyses alternative metrics between partitions. From both theoretical and applicative perspectives, a seemingly most appropriate distance between any two partitions is HD, which counts the number of atoms finer than either one but not both. While faithfully reproducing the traditional Hamming distance between subsets, HD is very sensible ...

متن کامل

A bag-of-paths framework for network data analysis

This work develops a generic framework, called the bag-of-paths (BoP), for link and network data analysis. The central idea is to assign a probability distribution on the set of all paths in a network. More precisely, a Gibbs-Boltzmann distribution is defined over a bag of paths in a network, that is, on a representation that considers all paths independently. We show that, under this distribut...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of mathematical biology

دوره 48 5  شماره 

صفحات  -

تاریخ انتشار 2004